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We consider estimating the predictive density under Kullback- 
Leibler loss in a high-dimensional Gaussian model. Decision theo- 
retic properties of the within-family prediction error - the minimal 
risk among estimates in the class Q of all Gaussian densities are dis- 
cussed. We show that in sparse models, the class Q is minimax sub- 
optimal. We produce asymptotically sharp upper and lower bounds 
on the within- family prediction errors for various subfamilies of Q. 
Under mild regularity conditions, in the sub-family where the covari- 
ance structure is represented by a single data dependent parameter 
Ti = d ■ I, the KuUback-Leiber risk has a tractable decomposition 
which can be subsequently minimized to yield optimally flattened 
predictive density estimates. The optimal predictive risk can be ex- 
plicitly expressed in terms of the corresponding mean square error of 
the location estimate, and so, the role of shrinkage in the predictive 
regime can be determined based on point estimation theory results. 
Our results demonstrate that some of the decision theoretic parallels 
between predictive density estimation and point estimation regimes 
can be explained by second moment based concentration properties 
of the quadratic loss. 



1. Introduction and main result. 

1.1. Background. We consider a prediction set-up where the observed 
past data X and the unobserved future data Y are generated from a joint 
parametric density fg{x,y) where 6 is the unknown parameter. A perspec- 
tive in prediction analysis is to use the concept of predictive hkelihoods 
(Hinkley, 1979, Lauritzen, 1974) and its variants (Bj0rnstad, 1990), to infer 
about the future Y based on X, with 6 playing the role of a nuisance param- 
eter. Most predictive likelihoods (Butler, 1986) are functions of the future 
conditional density /^(y |X = x) which is also referred to as the predictive 
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density (Geisser, 1971). Efficient estimates of the predictive density will en- 
sure good predictive performances. Here, we study the problem of predictive 
density estimation in a high-dimensional Gaussian model. 

We consider the multiple regression model analyzed in George and Xu 
(2008). Suppose, the observed past X is independently generated from mi di- 
mensional product Gaussian density N{A 0, I) indexed by an n-dimensional 
unknown parameter and known variance and known mi x n data-matrix 
A. The future Y is generated from the m2-dimensional Gaussian density 
N{B 6, (7j /) with the time-invariant parameter 6 and known m2 x n dimen- 
sional matrix B and known future volatility <Tj. 

Homoscedastic Gaussian Predictive Model 

M.l X ~ Af(A0, 0-2/) and Y N{B 0, a) I) 

The location structure depend on the time-invariant unknown vector 6 of 
length n. If is fixed the true predictive density of Y would be P[e,B){') = 
N{B 9, (Tj /). We would like to estimate it by density estimates p (.|X = x). 

We use the information theoretic measure of Kullback and Leibler (1951) 
as the goodness of fit measure between the true and estimated distributions 

Averaging over the past observations X, the predictive risk of the density 
estimate p(- |X = x) at is given by 

(1) p[0^p) = JJ (x) p^o B) (y) log (^^^j^^ • 

The relative entropy predictive risk p (^6, p^ measures the exponential rate 
of divergence of the joint likelihood ratio over a large number of independent 
trials (Larimore, 1983). The minimal predictive risk estimate maximizes the 
expected growth rate in repeated investment scenarios (Cover and Thomas, 
1991, Chapter 6 and 15). Competitive optimal predictive schemes (Bell and Cover, 
1980) for gambling, sports betting, portfolio selection, etc can be constructed 
from predictive density estimates with optimal Kullback-Leibler (KL) risk 
properties. Our Gaussian predictive framework can accommodate a fairly 
large number of prediction scenarios as often, in high-dimensional models, 
good normalization transformations of the data are available (Efron, 2010, 
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Chapter 1, Page 8). In the data compression set-up L(0, p( • |x) ) re- 
flects the excess average code length that we need in Gaussian channels if 
we use the conditional density estimate p instead of the true density to con- 
struct a uniquely decodable code for the data Y given the past x (McMillan, 
1956). The notion can be extended to a sequential framework where mini- 
mizing the predictive risk would result in the minimum description length 
(Barron, Rissanen and Yu, 1998, Rissanen, 1984) based estimate of the true 
parametric density (Liang and Barron, 2005). 

Here, we discuss efficient estimators in the class Qn of all n~dimensional 
Gaussian distributions with positive definite (p.d.) covariances as the di- 
mension increases, i.e., 

a = 1^ : ^ such that g = N{fi, S) where /i G and S p.d.j. 

For any class C we define the predictive risk of the class as 

pc{e) = mi pie,p). 

pec 

As the true parametric density is also Gaussian, pg{9) represents the within- 
family predictive risk. We also evaluate the predictive risk of the sub-family 
Q\p\ which contains all product Gaussian densities. We also make inferences 
in sparsity restricted parameter spaces. We impose an Iq constraint on the 
parameter space: 

(2) e(n,s) = G M" : / 0] < s| . 

This notion of sparsity is widely used in modeling highly interactive sys- 
tems (represented by a large number of related parameters) which are dom- 
inated by only few significant effects. Sparse models have been successfully 
employed in biological sciences (Tibshirani et al., 2002), engineering appli- 
cations (Donoho, 2006) and financial modeling (Brodiea et al., 2009). The 
predictive model M with Iq constraint on the location structure can be used 
for sparse coding and for prediction in sparse networks. 

As in point estimation, risk calculations in M would intrinsically depend 
on risk calculations in the orthogonal model: 

Orthogonal Gaussian Predictive Model 



M.2 X ~ N{e, al I) and Y ~ N{e, a} I) 
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where X and Y are both n - dimensional vectors. Most of our calcula- 
tions will be in high-dimensions (which means n — t- oo in the orthogo- 
nal model) though dimension independent bound will also be provided. As 
n — )• oo, M.2 represents the Gaussian sequence model (Nussbaum, 1996) and 
has been widely studied in the function estimation framework (Johnstone, 
2012). Estimation in M.l can be linked with the decision theoretic results in 
M.2 through the procedure outlined in Donoho, Johnstone and Montanari 
(2011). 

Our Contributions. Efhcacy of predictive density estimates has been a 
subject of considerable interest in predictive inference. Aitchison (1975), 
Asian (2006), Hartigan (1998), Komaki (1996) determined asymptotically 
optimal (admissible) Bayes predictive density estimates in fixed dimensional 
parametric family whereas minimax optimality in restricted parameter spaces 
has been discussed in Fourdrinier et al. (2011) and Kubokawa et al. (2012). 

Recently, Brown, George and Xu (2008), George, Liang and Xu (2006), Ghosh, Mergel and Datta 
(2008) extended the admissibility results to high dimensional Gaussian mod- 
els. However, the optimal estimates are not necessarily Gaussian and using 
them in high-dimensional problems would involve computationally inten- 
sive methods. Here, we find optimal predictive density estimates within the 
Gaussian family and also compute their predictive risk. It is computationally 
easier to construct predictive attributes based on our optimal Gaussian pre- 
dictive density estimates and the optimal Gaussian predictive risk assures 
guaranteed performances of our strategies. 

Minimizing the Gaussian predictive risk involves simultaneous estimation 
of the location and scale parameters. The issue of joint estimation of loca- 
tion and scale (and to a degree the shape) has not been addressed before 
in one sample Gaussian models. However, separate estimation of location 
(Tibshirani, 2011) and covariance (Friedman, Hastie and Tibshirani, 2008) 
are well-studied topics in constrained Gaussian estimation. Also, as reviewed 
in George, Liang and Xu (2012) decision theoretic parallels exist between 
point estimation theory under quadratic loss and predictive density estima- 
tion under Kullback-leibler loss in high-dimensional Gaussian models. Here, 
our results demonstrate that some of these decision theoretic parallels (in the 
class G) can be explained by second moment based concentration properties 
on the quadratic loss of location point estimators in high dimensions. The 
moment based approach used here for estimating the scale parameter bears 
resemblance to concepts seen elsewhere in prediction theory, particularly in 
the the theory of cross validation (Yang, 2007) and covariance penalties for 
model selection (Efron, 2004, Ye, 1998) . 
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1.2. Description of the main results. We describe results of two different 
flavors concerning (i) the minimax risk of the class Q of Gaussian density 
estimates in sparsity restricted spaces (ii) the optimality (asymptotic ad- 
missibility, oracle inequality, risk upper bounds) of estimates in class G in 
unrestricted space (over R" as n — >• oo) where the role of shrinkage comes 
into play. In order to describe the results, we need to introduce the following 
notations. 

Notation and Preliminaries. As some of our results are dimension de- 
pendent, henceforth we refrain from using bold representation for vectors 
and denote the dimension in the subscript. Given any fixed sequence ^oo 
we represent the first n values by the n-dimensional vector 9n whereas 6{n) 
denotes the n*^ value, i.e 9n+i = {On, 0{n + 1)). By Qn[p] we denote the class 
of all n-dimensional product Gaussian densities 

Gn[p] = l^i/^ffn Dn] ■ fin G 1^"" ^ Dn is any n X n p.d. diagonal matrix 

where g[fimDn] is a normal density with mean fin and diagonal covari- 
ance crj^Dn- We represent the minimal Gaussian predictive risk by pg{On) '■= 

Our shrinkage results will mostly refer to the sub-family ^^[1] of ^nb]- 
Gn[^] contains Gaussian densities with only one data-adaptive scale estimate 

gn[l] = \9[fin,c] : fin G and c G M 



where g[0n, c] denotes a normal density with mean fin and covariance ca'j I. 

A typical density estimate in QnlM is represented as g[6n,c{n)] where On 
is a location estimate and c(n) is the scale estimate based on observing an 
n-dimensional past observation X„. For any fixed location estimate On, the 
optimal risk of density estimates in t/„,[l] centered around On is given by 

Po{0n,0n) = inf p{0n,g[0n,c{Xn)]). 

c(x„)eM+ 

The quadratic risk of the location estimate is denoted by 

q{0n,0n)=^eJ0{Xn)-0nf 

where the expectation is over the observed past Xn. Later, we show that if 
the value of q(On, On) were known, then the optimal choice for scale is 

WeM = '^ + n~'r~\{0n,0n) 
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which will be called as the Ideal Flattening coefficient for 0„ at Here, 
given a location estimate On we construct suitable estimates c(n) of the scale 
such that asymptotically when n — )• oo the density estimate 5[0„,c(n)] is 
optimally flattened in the sense that p{9n-, g[On^c{n)\) — po{On,9n) < O (1). 
However, for proving optimality of the flattening coefficient we need the 
following mild regularity conditions on the location estimate 9n- 

(3) q{enA)<0{n). 

(4) Yare„ ( || a„ - 6^ f)<0 (n) 

and the existence of a suitable estimate \][9]{Xn) for the quadratic risk of 
On at 6n with the following properties: 

(5) \Ee,^{Un) - q{9n,en)\ <0{n^/^). 

(6) Yare,^{Un) <0{n). 

(7) Vare„ |(l + (nr)-i[/„)^'| <0{n-^). 

These properties are fairly mild and in Section 2 we show that most popu- 
lar point estimators obey these above conditions. We call these conditions 
Reasonable Asymptotic Square Loss (RASL) properties and the set of point 
estimators in the sequence model (where the action set is R°°) which sat- 
isfies these conditions is denoted by A. Also, we denote the ratio of the 
future to past variances by r := a'j/ap. Our results will depend on r. For 
sequences, the symbol ~ bn means an = + o (1)) and o„ bn means 
o-n/bn £ (^1) ^2) where ki and k2 are constants. 

Results. We show that in high dimensions, the minimum predictive entropy 
risk of Gaussian density estimates around reasonable location estimate On 
can be expressed in terms of the corresponding quadratic risk of On- The 
minimum predictive risk can be attained by optimally flattening the normal 
density estimate around On- The choice of the optimal flattening coefficient is 
not unique. An asymptotically efficient choice based on a reasonable estimate 
U[On]{Xn) of the quadratic risk of On can be made. 



Theorem 1.1. For any estimator in A we have 



(8) 



9nJn)-^log(l + inrr' ■ qi 



< O fl) as n ^ 00. 



And ifc{Xn) = 1 + {nr)-^U[0]{Xn) is based on a suitable estimate Un of 
the quadratic risk as defined in Equations (5)-(7) then 

(9) p{en,g[On,c{n)])-poi0n,en) < 0(1)- 
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We represent the optimal density estimate g[9n,c{n)] by g[9n]- Based on 
the asymptotic relations between po{6n, •) and the Mean Square Error (MSE) 
q{6n,-)y we can characterize the predictive risk of g[9n] easily by plugging 
in standard oracle inequalities from point estimation theory. We check that 
RASL conditions defined in Equations (3)-(7) hold for the James-Stein es- 
timator (Stein, 1981) 

9^^ = Xn i 1 



X l|2 

II 



and its positive part estimator 6'^^'^. For the James-Stein estimator we 
determine the deviations from the optimal risk in terms of dimension de- 
pendent bounds. 

Theorem 1.2. For any dimension n > 10 and for any On G we have, 

where the constants an,bn,ln are independent of the parameter but depend 
on the dimensions n and are given by 

an = 3(1 - (n - 2)-l)"^ 6„ = 4(2 + a„ + k2{n)), 
In = 3(1 — 2/n)~^, k2{n) = max{e(n), /(n)} with 



en = V3 J](l - (2i + l)/n)}-i/2 and /n = (1 - (logn/n)'/^)-^ 
1=1 

Also, piOn, g[On^\) can be approximated by using the following bound 

p{On,g[eil']) 



^.JS^\ nloglFeJ 



^ {aj^ bl!^ r-^l^ + {an + bn) r"^ + r-^) 



These bounds hold for any value of r G (0, oo). As the ratio of the future to 
past variances r decreases, we need to estimate the future observations based 
on increasingly noisy past observations and so, the difficulty of the density 
estimation problem also increases. So, as expected when r decreases the 
bounds also increases. These bounds can be made dimension independent. 
In particular, for all dimension n > 20, for any On G M" and for any fixed 
value of r G (0, oo), we have, 

(10) p{9n^g[0^'^]) - p{0n,Bf) < 5.3r-3/2 ^ ^^ ^^-2 ^ ^ j^-3_ 
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In point estimation theory, there exist sharp oracle bounds on the quadratic 
risk q{On,0^^) of the James-Stein estimator ^ '^'^( Johnstone, 2012, Chapter 
2) which along with Theorem 1.2 produce the following oracle bound on the 
predictive risk of shrinkage predictive density estimates. Assuming that the 
value ll^nlP is known, the risk of the ideal linear predictive density estimate 
is given by 

(11) IL = ^ log f 1 + r-i wherea„ = ||0„||Vn. 

2 V 1 + On/ 

The difference in the risk of g[9^^] and the optimal oracle linear risk is 

(12) p{en,g[e;{^]) - ILiOn) <0.lr^^ + 5.3r^^/^ + 18.lr~^ + 1.7r"\ 

Comparing this to the oracle bound of Xu and Zhou (2011) which is derived 
based on an empirical Bayes perspective 

(13) p(e„,(7[^„^^]) -IL(0„) <2r-i+5r-2+4r-3, 

the particular features of our moment based approach can be seen. As our 
oracle inequality is a by-product of the optimal Gaussian risk, for most 
values of r the bound in the Inequality (12) is coarser than that in Inequal- 
ity (13). However, when r = 0.1, the RHS in the Inequality (12) is 3830 and 
is better than the bound (4520) in the latter. Thus, the moment based ap- 
proach can be quite informative. The bounds derived on the predictive risk 
are sharp enough to derive decision-theoretic optimality. We can produce 
unrestricted improved minimax predictive densities which asymptotically 
behaves like ideally shrunk linear density estimates (as defined later). The 
following lemma shows the asymptotic improvement in the predictive risk 
over the best invariant predictive density g[Xn, 1 + (Liang and Barron, 
2004). 

Lemma 1.1. If ||0n|P — )• oo as n ^ oo, then we have 
[ a ] p{en,g[Xn, l + r])= 2-1 n log (l + r~^) . 

[b] p(6l„,5[^;f'^,r]) ~ (2r)"ina„(l -Fa„)"i where an = n'^ WOnW'^ ■ 

[c] p(0„,5[^/^, 1 + r]) ~2"in {log(l + r-i)-(l + a„)"i(l+r)-i}. 

[d] p{9n,g[e^^]) ^2-^ n log {1 + r-^ anil + an)-^}. 

The improvement in predictive risk due to efficient choice of location is 
reflected by the risk of g[6 '^^ , 1 + r] where as the effect of the optimal choice 
of scale after choosing an appropriate location estimate can be followed by 
evaluating the asymptotic predictive risk of g[9 '^^]. 
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The regularity conditions that we impose on the location point estimates 
do not extend to convex collections of estimates in A. But, the predictive 
risk still concentrates and the optimal predictive risk po can be determined. 

Lemma 1.2. For any countable collection A of estimators 6[\] in A and 
their convex collection 6^ = J2xeA'^>^^i'^] ^^^^ Z^agA^a = 1> we have 

po(^n, C) - 5 5Z ^'A log (l + (nr)-i • q{en, ^n[A])) < O (1) asn^ oo. 
AeA 

And, the predictive density estimate X]AGA^Afi'[^n[A]]) is asymptotically op- 
timal in the sense 

(14) p(en,Y,wx9[0n[\]]) - Po{On,e:;:) < 0{1) asn^cx). 

^ AGA ^ 

The class G[p] of product Gaussian density estimates is minimax optimal 
over ellipsoids (Xu and Liang, 2010). However, we show that the class G[p] 
is minimax sub-optimal over the £o~sparsity constrained space Q{n, s) when 
s/n — )• as n — oo. 

Theorem 1.3. For any fixed r G (0, oo] as n ^ oo for every sequence 
Sn with s„/n — 7- 0, we have 

(15) _min max p{9n,p) = slog{n/s){l + o{l)). 

P&Gn[p] Q{n,s) 

By Theorem 1.2 in Mukherjee and Johnstone (2012), we know that the 
asymptotic minimax risk R{n,s,r) over Q{n,s) is given by 

(16) R{n, s, r) ~ (1 + r)~^ s log(n/s) as n — ?• oo, s — ?• oo and s/n — t- 0. 

Hence, the minimax sub-optimality of the class G[p] over the Iq sparse space 
is 1 + r~^. The parametric space @{n,s) is not invariant to the group of 
orthogonal transformations. If the parameter space does not have any sparser 
representation with respect to the group of orthogonal transformations, then 
the asymptotic sub-optimality of the class G is also 1 -|- r~^. 

1.3. Organization of the paper. The predictive error of the class G[l] of 
predictive density estimates is presented in the next section, which discusses 
the role of shrinkage in high-dimensional prediction problems. Predictive 
error and restricted minimax risk of the class G is presented in Section 3. 
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2. Role of shrinkage and optimal error in ^[1]. 

Hereon we will assume that fjp = 1 and (Tj = r. The general predictive KL 
risk will not be affected by this restriction. However, the density estimates 
are usually based on statistics equivariant to the scale transformation and 
needs multiplication by dp. 

Heuristic Idea:. In the high dimensions the quadratic loss of a reasonable 
point estimator will concentrate around its risk And, so the KL risk of the 
corresponding Gaussian predictive density partitions into two parts involv- 
ing (i) quadratic risk on the location parameter adjusted by the expected 
scale (ii) logarithm of the expected scale. As such, the risk p[On, g[9n,Cn\) 
of the normal predictive density estimate g[9n, Cn] is given by 



n 
2 



E,„(log^(X„)) +Ee„ ( + —Et ' ^^"^ 



diXn)J j nr c{Xn) 

In high dimensions, due to concentration of measure we expect 
• log (c„) ~ log Ee^Cn 

. Ee„ (\\e{Xn) - 9nf • {Xn)) ~ (E.^C^^ 

which will lead to 

p{0n,g[9n,Cn\) ~ - < logE^^Cn H 1 >+0(l) as n oo. 

^ [ ^er^cn J 

This asymptotic decomposition of the predictive risk can be explicitly vali- 
dated through the RASL properties. Because of this decomposition for any 
fixed point estimate 6n at each parametric value 9n we can minimize the 
above asymptotic value of p{6n, g[dn, ■]) over the scalar quantity KgJ^n- The 
minimum asymptotic value is given by 

p{en, gidn]) ~ n/2 ■ log (1 + (nry^qien, h)) 

and the optimal value is attained when 

E,„(c°P* {Xn) ) = 1 + (nr)-i qiOnX) = l^eM 

which is the ideal flattening coeflicient. However, l'FQ^{6n) is unknown. But, 
it depends only on the parametric value On- Thus a choice would be c°p* {Xn) = 
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1 + {nr)''^U[9n]{Xn) , where U[9n]{Xn) (to be abbreviated as Un) is reason- 
able (i.e with reasonable bias and concentration properties) estimate of the 
quadratic risk of On- With very high probability such an optimal choice of 
Cn will be greater than 1 reflecting a flattening of scale of the estimated pre- 
dictive density (with respect to the true future variability). Intuitively, we 
are performing an appropriate flattening of the density based on empirical 
estimates of the quadratic loss. The optimal density g[9n] levels out with 
increasing inaccuracy in the location estimate On- 

One of the popular Frequentist notion (which is better than plug-in den- 
sity estimates) of constructing predictive densities in this parametric model 
is to use Gaussian density estimate around an efficient location On and vari- 
ance r+Yar{On)- Estimates of these kind are natural extensions of confidence 
sets. The optimal density estimate g[0] is quite similar except with a larger 
variance r + q{On, On)- And unless the bias of On is negligible compared to its 
variance the above mention general notion produces sub-optimal density es- 
timates. Next through the RASL conditions we will quantify some statistical 
regularities in the behavior of quadratic loss in high dimensions. 

2.1. RASL Properties of a Location Point Estimate. In high dimensions, 
for any fixed location parameter On and its estimate On we expect the 
quadratic loss \\0n — ^nlli to be concentrated around its expected value 
(quadratic risk) Eg^H On — ^nlli it would be refiected by its variance. We 
will also rule out very bad point estimators by neglecting those with too 
high risk as we do not want them for prediction purposes. Apart from these 
we also assume the existence of a statistic which estimates the quadratic risk 
within reasonable bias. These properties of point estimators are referred to 
as Reasonable Asymptotic Square Loss properties and the corresponding 
location estimates as RASL estimates. 

As dimension n — ?• cxo, for any fixed parametric value On the location point 
estimate 0{Xn) is such that its quadratic loss has the following properties. 

PI. Reasonable Risk: 

EeJI^n-^nf < 0(n). 

The canonical minimax point estimator Xn which is also the UMVUE 
(under square loss) in this case acts as the benchmark in weeding out 
the bad point estimators. For any parameter value 0^ Xn has con- 
stant risk n. So, it is appropriate for our purpose to restrict ourselves 
to point estimators with risk of the 0(n). 
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P2. Concentration property of Quadratic loss: 

Vare„(||^„-0n||2) < 0(n). 

In high dimensions the estimator On is such that its loss has variabil- 
ity less than 0{n). Again comparing with Xn, we see Yar^^ ( || X„ — 
0„ ) = 2n as II Xn — |P is distributed as a central random 
variable with n degrees of freedom. 

P2 implies concentration of the loss function and would in turn also 
impose some concentration properties on well-behaved functions of the 
loss. As such, using Lemma A.l, we have 



P2.a 



1 + 



nr 



< O(n-i) 



following directly from P2. It is an important condition and will be 
used in our derivations. 



P3. Reasonable Estimate of Quadratic Risk: 

There exists an estimator U[9n]{Xn) (will be abbreviated as C/„) of 
the quadratic risk of On satisfying the following: 



P3.1. 



E,„(U„) -EeJI^n 



<0(ni/2 



P3.2. 



Vare„(Un) < 0(n). 



P3.3. 



Var0„ < 



1 + 



-1 



nr 



< o(n-^: 



P3.1 implies existence of a statistic which estimates the quadratic risk 
by not making significant bias. Bias exceeding 0{y/n) is considered sig- 
nificant here and the order is associated with the 0{n~^) asymptotic 
statements we would like to make. P3.1 and P3.2 are analogous to 
P2 and P2.a respectively. They imply that the asymptotic concen- 
tration properties associated with the quadratic loss also holds for its 
estimator C/„. If is positive then P3.2 follows directly from P3.1 
by Lemma A.l. 
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2.2. Validating the RASL properties. Given a location point estimator 
and its corresponding reasonable quadratic risk estimate the RASL condi- 
tions can be checked at least by simulations. However, existence of a 'rea- 
sonable' risk estimator (as defined in P3) is essential. For most widely used 
point estimates, we can construct risk estimates satisfying the three condi- 
tions in P3 though the procedures can sometime get quite complicated. 

If On is the posterior mean - generalized Bayes estimate with respect to 
prior TT, then by Tweedie's formula Brown (1971), Robbins (1956) we have 
explicit expression of an unbiased estimate of its risk as, 

u: = n- 

V'^m^iXn) 

Un is a natural candidate for a 'reasonable estimate of the quadratic loss' 
though P3.2 and P3.3 are also to be checked separately. In particular, for 
P3.3 to hold U^^ may need some modification by introducing some bias. 

For spherically symmetric estimators, we can get candidates for 'reason- 
able' risk estimates by using Stein's unbiased (quadratic) risk estimates 
(SURE) or their modifications (like positive part, etc) Stein (1974, 1981). 
As mentioned before, here too we needed to introduce some bias to the the 
SURE estimate as the unbiased one does not has property P3.3. 

These RASL conditions are quite mild and usually holds for reasonable 
point estimates and can be checked by Monte Carlo simulations for arbi- 
trary point estimates. Next, we check these conditions analytically for the 
following popular point estimators: 

6'^^ : James Stein estimator 

0JS+ . Positive part James Stein estimator 

6^ : Posterior mean of harmonic prior TTniOn) oc H^nll"*-""^^- 

All these 3 point estimators are linear estimates of the form s{Xn)Xn where 
s{Xn) is a data-dependent shrinkage term. They are better than the canon- 
ical minimax estimator While 9^ is admissible, 9"^^ and 9'^^'^ are both 
inadmissible. As such both 9'^^'^ and 9^ dominates 9'^^ . However, in high 
dimensions, they behave similarly and have near ideal linear risk proper- 
ties. We will construct reasonable risk estimates for each of these estima- 
tors. While verifying the RASL conditions for the JS estimator we would 
also compute the bound explicity for each n. It will be needed afterwards 



IT71 fV M|2 m^{Xn) 

|Vlog?Ti^(X„)|| -2 

m^(A„) 



where V/ = Di f 



4 = 1 



n 

mT,{Xn) and m.^{Xn) = / 4>n{Xn\9nA)'^{0n)d9n. 

1=1 •' 
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in Theorem 1.2. Since, the estimators are spherically symmetric it will be 
more informative to derive bounds depending on 1 1 On 1 1 . Hence forth in this 
section, by a„ we denote A convenient fact about this spherically 

symmetric estimators is that the n-dimensional parameter On can be substi- 
tuted by (ll^nll) 0, . . . , 0) while checking the asymptotic behavior of square 
loss. As these estimators are not Lipchitz functions of the normal random 
variable X, we can not directly use well-established Gaussian concentration 
inequalities (Dembo and Zeitouni, 1993, Ledoux, 2001). 

2.2.1. James Stein estimator. The James-Stein estimator and its unbi- 
ased risk estimate is given by: 

Of=Xn(l-^), and U{d;!')=(n " 



RASL property PI. holds as the JS is better than the canonical estimator 
Xn- As such a good upper bound on its risk is also known 

E,\\P^-0nf<2^ (l-2/n)a. 



(1 - 2/n) + an 

Lemma 2.1. 

-^n\^ <4 2n + (n-2)^n-2A:i(n) + n/c2(n) 

Proof. We decompose \\0n^ — into 3 parts as 

ll^n^ - ^n\? = \\Xn " On\? + (n - 2f\\Xnr'' + 2{n - 2)M„ 

where M„ = {^Xn — On, X„||X„||~^). Then we use the naive inequality that 
for any three random variables Zi,i = 1, 2, 3 

3 3 3 3 

VarlY^Z^ <^Var('^(-l/{^=^>Z,] =4^Var(Z,) 

^ i=l ' j=0 ^ i=l ^ 1=1 

to get the following bound on Varg^ — ^nlP) 

< 4|var,„ - Onf) + Var,„ (^^) + 4 (n - 2)^Var,„ (M„) | . 

Now \\Xn — ^n|P has a central chi-square distribution with n degrees of 
freedom and hence its variance is 2n. The bounds on the other quantities 
follow from Lemma 2.2 and Lemma A.l. □ 
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Lemma 2.2. For n > 10 we have VarQ^^{Mn) < n ^k2{n) where 

V3 



k2{n) = max{/i(n), A;(n)} where en 



nti(i-(2^ + i)/^)}^/' 

""""^ " (l-(logn/n)V2)2- 

Proof. The variance of M„, is same as the variance of X„)||X„,||~^ 
whose distribution is spherically symmetric in 9^ as it can be written as 
sum of two spherically symmetric terms Hn = {6n,Xn — 9n)\\Xn\\~'^ and 
Jn = ll^nlPll-'^nll"^) ■ So, with out loss of generality we can assume that 
9n = {6,0, ... ,0) where 6 = \\9n\\. We also divide the proof into two cases 
depending on the magnitude of 6. 

When 6* < V" we have, Var^^ (M„) < 2{Yarg^{Hn)+Yare^{Jn)) with the 
later being less than n^^ by Lemma A. 3. And, the former is bounded above 

by E{H^). Now, with Z = N{0, 1) and W = xLi(O) and V = {Z + e^ + W 
it can be rewritten as 

>. 1/2 



nti(^-2i-i) 



which is less than n'^ V?,W^i^^{l - {2i + l)/?i)~^/2^ 

When 6 > n, we first recall that M„ = {9 + Z)/W and so 

E{M^) < E{V-\e+z\<i}] +E{{0 + zr%g^z\>i}} 



/oo 
; 



X ^(t>(x — 9) dx 



< [{n - 3)(n - 5)]-^ + $(\/k^) + {9- ^ 

< [(n- 3)(n - 5)]"^ +n-^(logn)-^ +n-^(l - (log n/n)i/2)-2 

Hence the result follows. □ 

Though it is very tempting but we can not use the unbiased risk estimate 
U{ 9,^^ ) as the estimate can be negative and violates P3.3. 

Lemma 2.3. For any fixed n and r, Eq[{1 + {nr)~^U{9;l^ )] ^] does 
not exist. 

We will instead use the positive part of C/( 9^^ ) and the scale estimate 
c^s+ ^ij^ (nr RASL condit ion P3.1 can be easily checked as 

YareAU^)<YareAUi9f)) = (n - 2)^Var,„( = 0(n) 
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by Lemma A.l and P3.2 follows from Lemma A. 3. As such, an exact di- 
mension dependent bound can also be derived. 

Lemma 2.4. For any fixed n > 3 we have 



\Biasg^{c^^~'~)\ < k^{n)n ^1'^ where k^i^n) 



V2 + 5n-V2 
1-2/n 

Proof. Noting that Biase„ (c;^^+ ) = ri"^/^ E^^ (C/^) and 



n\Ee^{U-)\<Ke„ 



n 



--l].l{Y<n} 



where Y follows Chi-square with degree n and non-centrality parameter 

ll&niP- We know that Y = Xn+2N where = Poisson(||6'„|p/2) and the 
above expectation can be written as 



E, 



E 



n 



n+2N 



1 • I{Yn+2N < n} 



N 



) < E 



n 



1 • l{Yn < n} 



where Yn+2N is a central chi-square random variable with (n -|- 2A^) degrees 
of freedom and the second inequality follows as for any A'^ > 0, (n/l^+2Af — 
1) • I{Yn+2N < n} is stochastically dominated by A" = 0. Now, 



E 



n 
Y 



1 • I{y < n}N 



" n - y ^"/^-le-?'/^ E\Wn - n\ 

dy < 







y 2"/2r(n/2) 



n-2 



where Wn ~ Gamma(n/2 — 1, 1/2) and so 

E\Wn-n\<l + E{Wn -nf <l + A + Var{Wn) = 5 + V2n 

where the second inequality follows by Bias- Variance decomposition. Thus, 
we get our result. □ 

2.2.2. James-Stein Positive part Estimator. We consider the positive 
part of the JS estimator and a reasonable estimate of its loss as 



)JS+ 



A„, 1 



n-2 
\\X IP 



and U{e;i^+) 



n 



{n - 2f 
IIA„||2 



There exists unbiased estimator of the quadratic risk of 9^^~^ (Johnstone, 
2012, Exercise 2.13). We use a biased estimator here mainly to highlight the 
fact that even biased estimators will work. 
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PI. follows from the fact that 6'^'^+ is better than 6* (Johnstone, 2012, 
Exercise 2.8). 

For checking P2, define C„ to be the event {X„ : d-^^+iXn) / 0} = {X„ : 
^^■^^ = 9n^}. And the idea is to relate the variance of the loss in JS+ case 
with the case of JS estimator. 



= E{\\e;!' - 9n\\^Ic„} - E^lll^i^ - 9^flc„} + \\9n\\'P{C^n) " 1 1 | (C^,) 

= Var,„(||^„^^ - 9J'\ Cn) • PeACn) + \\9n\\'Pe„{Cn)Pe„{C'n) 
<^areM^n' -Onf) + \\9n\\'PeAC'n) 
as Vare„(||^„^^ - ) > E,„ (Var,„ - ^.f |C„)) 

Weknowthat Var0„( 11^^"^- 6'„ IP) is 0(n) and lemma A. 5 shows ||6'n||'^^'6i„(C^) < 



0(n). So, we have the desired bound. 

Condition P. 3.1. We will condition on the event C„ again and express 
P.3.1 in terms of the James-Stein estimator 



where is an central chi-squared random variable with n degrees of free- 
dom. Now, we decompose /„ into 



Ee^U{9;i'+)-q{9nX'^)=Ee., 




On\?P{Cl). 



When 9, 



then the R.H.S for large n reduces to, 




In = ll + 2ll where 



ll = E{{n-Yn)IcJ ^ndll = E{{ 




We standardize Yn as Zn = {Y^ — n)/ \j2ji. We c 
equalities on and have, [need to make rigorous] 




as on Cn, Z„ > 0. Thus 



/„ < O(V^). 
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Condition P3.2. By Lemma A. 8 we have 

Yare„ ( U{ 9^'+ ) ) < Var,„ ( [/( ) ) = 0(n) . 
Condition P. 3. 3. Follows from Lemma A.l 

Harmonic Prior The conditions can be checked for 6^ hy using its closed 
form expressions in (Xu, 2007, Chapter 2). 

2.3. Determining pQ for RASL point estimators. In this section, we will 
show that in high dimension with very high precision we can express po(^n) 
- the minimum Predictive Entropy risk of the class of Gaussian density 
estimates around location On in terms of the Mean Square estimation error 
of On by On- We initially prove bounds on the error rates which holds for all 
dimensions but are dimension dependent. Then, we would show that in high 
dimensions those bounds are asymptotically sharp. 

2.3.1. Lower Bound on po{Om ^n)-' • Next, we produce a lower bound on 
the prediction error. The bound ultimately will be a function of On though 
it depends on the form of On- It involves expectation of a quantity which 
usually is neither a parameter nor a statistic and hence can not be computed 
in closed form. 

Lemma 2.5. For any dimension n, any parameter value On and any 
location point estimate 0{Xn), we have 

PoiOn, 0n) > \ E0„ {log (l + • • || ^„ - Onf ) } . 

Proof. For any fixed n, the risk of the predictive density qn which is 
a n-dimensional product normal with with data adaptive mean [Xn) and 
data and parameter dependent variance c{On,Xn)r (for all co-ordinates) is 
given by 2p{0n,qn) 

w Ji ^ , , l + inr)-^\OiXn)-On\\' A 
= nEo^^logciOn,Xn) + ^^^^ 1|. 

For any fixed value of On and for each Xn, 

log C {On, Xn)+C~\On,Xn){l + (nr) 1 1 - - 1 
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is minimized at c°p* {6n,Xn) = 1 + (nr) ^\\9(xn) — ^n|P and the minimum 
value is given by log(l + (nr)~^ || 6{xn) — ^n|P )• Hence, the result follows. 

□ 

Though c°P* {On-, Xn) is the best possible flattening coefficient, it depends 
on the parameter and can not be used in practice. As such, c°p* is 
the ideal flattening coefficient. In high dimensions due to statistical regular- 
ity we expect c°p* {9n,Xn) to be very close to its expected value 

which can be viewed as the (near) Ideal Flattening coefficient and is referred 
to as IFo^ {6n) = l + n~^r~^q{9n,9n)- Here flattening coefficients are usually 
called scale and it should be noted that the corresponding variance needs to 
be multiplied by r. 

From Lemma 2.5 we can derive a worse but more tractable bound 
(17) po{9n, On) > 2-^ Ee„ {log (|| 9n - 9nf/{nr) ) } . 

2.3.2. Upper Bound for po{9n,9n):- We now produce an upper bound 
on the risk of any Gaussian density estimate. Henceforth, SD would mean 
Standard Deviation and by Bias of the scale estimate Cn we would mean the 
expected deviation from the near ideal flattening coefficient IF g^{9n)- With 
scales estimators based on the statistic U[9n]{Xn) and of the form c (X„) = 
{l + n-W[9n]{Xn)) we have Biase„(c„) = (nr)-i [E,„[/[^„](X„) -(?(0„,, 9n)] . 

Lemma 2.6. For any fixed dimension n, parameter value 9n, location 
point estimate 9{Xn) and any scale estimate c{Xn) > almost surely and 
of the form c (X„) = 1 + {nr)~^U[6n]{Xn) , we have 

^ Tl ^ ft ^ ^ 

p{9n,g[9n, Cn]) " - ■ log {iFeM) < ^ ' (^n, ?„) + Se„ (0„, c„) 

where ^,„(^„,c„) = /F,„ (^„,) {E,„(c„)}''5Z),Jc„)5Z)e„(^„-') 

+ r^^gZj,„( "^";^""' )glj,,(g„^^) and 

BeAOn,Cn) = Biaslicn) {iFeMV' {Ee„{cn)}~\ 

Proof. The risk of the normal predictive density estimate g[9n ,Cn] is 
given by 2p{9n,gn) 



|2 



■'n I 
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Now, we replace Ee„ (c^^) by E^^^Cn and Eg„ ( 1 1^„-0„| P xc ) by Eg„ \\0n - 9 

Eg^Cn in the above expression to get p{On,gn) 

(18) 



n{EeMogciXn)) + 



Ee„(c(X„,)) 



1 EeJ\9{X^)-9^\\' 
^^r\ E,„(c(X„,)) 



n (l + {nr)-'Eeje{Xn)-en\ 



\ Ee^{c{Xn)) 
Bias0„ (c(X„)) 



Tl Tl 

= -E,Jlogc(X„))-- 
and the distortion caused thereby {p{9n,gn) — p{9n,gn)) equals 



(19) 



i + inr)-^\\e{Xn)-er, 

c{Xn) 



l + {nry^Eg\\6{Xn)-9n\ 
^eAc{Xn)) 



Next we will show that (?t./2) |r(0j^, 

?n) - ?'(6'n,?n)| < ^e„(^'n,Cn). Before 
that, note that if Un is unbiased then the second term in Equation 18 van- 
ishes and we have the result stated in Corollary 2.1. 
Now, note that 2n~^ r{6n,gn) equals 



logIF,J0„)+IE9„ 



log 1 + 



Biase„ (c(X„)) 
^eAciXn)) 



and using the inequality log(l + x) < x for all x > —1 on the second term 
on the right hand side it follows that 



2n-^ f{9n,gn) <loglFeM + 



Biase„ (c(X„)) ^ Biase„ {cjXn)) 
E,„(c(X„)) 



IF 



Bias^ (c(X„)) 
= loglFg (9n) + ^^"^ ^ = Bn. 

Now, we write 2n~i{r(6'„,^„) -f(0„,g„)} = i?0„(6'„,Cn) + Je,^(6'„,c„) where. 



He^(9n,cn) = IFe^(9n) <! ( - ^ 



and 



Je„(9n,Cn) = (nr) ^ ■ Eg,^ 



1 



Efl 



Note that the second term in H0^^{9n,Cn) can be rewritten as. 
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which by Cauchy-Schwartz (C-S) inequahty has lower absolute value than 

r 1 

(Ee„c„)-i I Vare„ (c„) x Vare„ {c~') | . 

Thus, \He^{en,Cn)\ <IFe^(en)Ee„Cnr^SDe„{cn) x SDe„(c„-i). 
Again, rewriting Jg„{9n,Cn) as, 

and applying C-S inequality we get 

I Je„(^n,c„)| < (nr)-iSDe„ (||^„ - 0„f ) • SD^ (c"!) . 

So (n/2)-i|r(^„,ff„)-f(0„,g„,)| < ji^ej^^, c„)| + 1 Je„(^„,Cn)| < A0„(^„,c„) 
and we have our desired result. □ 

Corollary 2.1. IfUn is an unbiased estimate of the parameter q (6 n, On) 
and Cn = 1 + (nr)~^C/„ > almost surely, then we have, 

Po{9nA)< p{0n,g[9n,cn]) < ^ Eg^ |log (l + (nr)-i[/„) } + c„)/2. 

The corollary follows from the above Lemma. The upper bound derived 
here involves expectation of a statistic along with a distortion term Aq^ {dn,Cn) 
which will be negligible under the RASL conditions. Ignoring it for the time 
being we can say that an upper bound is produced when ||^„ — in the 
lower bound of Lemma 2.6 can be replaced by a good statistic. Lemma 2.6 
has an upper bound based on IFg^{6n) and next we show that the lower 
bound and the upper bound are fairly close. 

Lemma 2.7. For any point estimate On and location parameter On G M" 
we have, 

Po{On, On)) > 2~Hog IFe^{On) " Lg„i0n)/2 where, 

LeM = {nr)-^ ■ SDe„ {\\0n - Onf) ■ 5Z)e„| (l + (nr)-' \\dn - ) | 

Proof. From Lemma 2.5 we have 
logIFe„ {On) - 2po(^n, On) < log IFe„ (On) - Ee„ { log (1 + (nr)"i II On - Onf ) } 

l{0n,0n) \ I 



E« Oog 1 



nr + II 6'n - On IP 



22 



where l{0n,0n) = \\0n — On IP — 9(^n)^n.)and using Jensen's inequality and 
log(l + x) < X consecutively, the difference becomes 



< -En 



l{On, On) 



nr +\\dn- On P 



KOnX)-\ ^ — ^-lEe ^ 



-E9„ 

nr +\\9n- OnW^ " \nr + || On 

and by applying C-S inequality the magnitude of the said difference is 

< SJ^eSWOn-Onf) X SDe„{(nr + ||^„-^„||2)-i} =L,„(^„). 

This completes the proof. □ 

Corollary 2.2. Under the conditions of Lemma 2.6 we have 

[i] < p{0n,g[dn,cn]) " />o(^n,^n) < {LeM + [A + S]e„(^„,C„)} 

[ii] |po(^n,^„) -2-Mog/FeJ^„)| < 2-imax {L9„(^„), [A + c„)}. 

The corollary follows directly by combining the above lemma with Lemma 2.6. 
It bounds the deviation of the predictive risk from a continuous, increas- 
ing function of the MSE . The RASL conditions ensure the existence of 
at least one candidate for the statistic Un such that c(X„) > almost 
surely (follows from RASL condition P3.3) and each of the associated terms 
Ae„{0n,Cn), Bg^{9n,Cn) and Lg^{9n) is of the order of 0(n~^). Hence, The- 
orem 1.1 follows. 

Proof of Theorem 1.1. Note that under the RASL conditions we have 
Ae^{On,Cn), Bn and Lg^{6n) to be of the order of 0(n^^). Also, note that 
the fact that c > almost surely is taken care in the the RASL property 
P3.3. □ 

2.4. Violation of the RASL conditions. Based on the lower bound of 2.5 
and concentrating around the expectation by using Chebyshev's inequality 
we have for any a in (0, 1) 

MOM > 5 log (1 + j 1 1 

So if PI of the RASL condition is violated i.e for some O'n £ K*^ we 
have q{0n,0n) > 0{n) then if P2 holds or we have Vargi - 6*^112) < 
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q{0'^,9n) then po{6'j^,6n) > l/21og(l + r"^) which is the minimax risk of 
the best invariant density estimate and so the class of density estimates in 
Q[l] centered around 6 does not have any minimax estimator. Thus, we can 
exclude bad point estimators in most conditions (also see Equation 17). 

Among the cases where RASL conditions does not hold the only exciting 
case is when P2 is violated but PI holds. In those cases the asymptotic 
predictive entropy risk can not be characterized in closed form. A example 
of a point estimator of this kind is: 



Si{Xi) ifi = l 
Xi if z = 2, • • • , n 



where the univariate point estimator 5i is given by 



n^/^(21ogn) X if x < (2 log n)^/^ 
X if X > (21ogn)-'^/^ 



2.5. Decision Theoretic implications. The asymptotic relation between 
the predictive risk and the mean square risk will help us in deriving oracle 
inequalities on the predictive risk of Qn- The bounds will be sharp enough to 
discuss asymptotic optimality in the class Q. We would first relate the class 
Q with the other decision-theoretic classes of predictive densities. Then, we 
would compare the predictive risk of the respective classes in unrestricted 
parametric spaces. 

In the above context, we consider the following 6 predictive estimates: 

• PL : As an representative of the class of all Linear predictive density 
estimates (£) we choose the predictive density g[Xn, 1 + r]. It is the 
Bayes predictive density with respect to the uniform prior, has con- 
stant risk and is inadmissible in C. It is the best invariant predictive 
strategy and is also minimax among all proceduresLiang and Barron 
(2004). 

• Pe : We choose the James-Stein positive part plug-in predictive density 
estimate r] as a representative of V. Though the positive part 
James-Stein estimator is inadmissible as a point estimate, it is difficult 
to find estimators that have significant improvements over it. And, for 
all practical purposes the JS+ estimator can be considered as a 'nearly' 
admissible point estimate. In that respect we can consider 

PE = g[^-'^+, r] where ^/^+ = X„ (l - 
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as an efficient representative from the class of Plug-in predictive den- 
sities (V). The subscript stands for the class of estimative (plug- in) 
distributions. 

• Ph : We consider the Bayes predictive density estimate from the har- 
monic prior tth as a representative of the class of all Bayes predictive 
density estimates It is an admissible rule. As such, it also domi- 
nates piGhosh, Mergel and Datta (2008), Komaki (2001). 

• Next, we consider 3 member of G which we will use to compare the 
risk of the predictive densities from the above 3 classes. 

— g[0 •^^+, 1 + r]: A non-linear, fixed variance predictive density es- 
timator around the JS+ estimator. It is uniformly better than p^. 
It is also denoted by (7 a/. 

— g[0'^^^]: The optimal member in Q{9 ■^•5+) which we will use to 
compare with and p^. 

— g[9 ^]: The optimal member in Q{6 We would like to compare 
its performance with pH- Also, g[9 ^] is asymptotically inadmis- 
sible among the procedures in Q. 

In Table 1 we evaluate the predictive performance of each of these density 
estimates on a dataset. 

Oracle inequalities and Implications. Lemma 1.1 describes the predictive 
risk of density estimates center around 6'^^ . 

Proof of Lemma 1.1. The results follows from Theorem 1.1 and by 
using Proposition 2.6 and Exercise 2.8 of Johnstone (2012) □ 

The lemma will not be useful in very very low signal-to-noise ratio. It 
can be used effectively when > 0(n~^). Note that, we can partition the 
improvement in the asymptotic prediction error over p^ in two parts. 

• We first shrink the location estimate while keeping the scale unper- 
turbed and move to a better estimate g[9 ■^'^+, 1 + r]. Let the improve- 
ment be denoted by d\. 

• Now we optimize the scale keeping the location fixed and arrive at 
g[9 '^^^]. Let the improvement be denoted by d^. 

And, based on the lemma we have, 
d\ ~ "n and ~ log(l - an)~^ where a„ = {(1 + a„)(l + r)}~^ 
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As < 1, d^, as well as — d\ are all positive and increasing in a„. 
It means we are actually making more improvement by adapting the scale 
than that we got by shifting location and their difference is also decreasing 
in both a.„ and r. 

Prediction error for shrinkage estimators. By shrinkage point estima- 
tors we define estimators of the form s{Xn) Xn where s{Xn) is an almost 
everywhere differentiable function. If ||^n|P were known, then spherically 
symmetric shrinkage estimators of the form s{an) Xn where a„ = H^^lP/''^ 
and s{an) < 1 would be efficient. Let S denotes the class of normal pre- 
dictive densities based on ideal point location estimators. Such an estimate 
satisfies the RASL condition P2 and so Lemma 2.7 can be used to calculate 
an optimal lower bound on the predictive risk of the family of density esti- 
mators based on 5 - the class of all shrinkage point estimators conditioned 
on a„. 

Note that by Bias- Variance decomposition the quadratic risk of the ideal 
point estimator s{an) Xn is given by 

^9„{\\snXn - = sl n + s^||6'„|p where s„ = 1 - and Sn = s(a„). 

Based on Lemma A. 3 we have Lg^{6n) < {nr)~'^Yar0^(\\6n — ^n|P) and for 
an estimator in 5 we have, 

Yare^{\\snXn-9nf) = ^arg^ ( ^ 1 1 x„ - 0„ 1 1 ^ + 2 s„ s„ (X„ - 0„ , 0„ ) ) 
< 2[4Vare„(||X„ - + 4^ Vare„((X„ - 

= 2nsl[sl + Asian] 
which is obviously less than 0(n) if On = 0{1). Otherwise, 

P0i9n,sian)Xn) > 2"%^ log(| - 0„| | VM) 

> Eg^ log |s„||6'„|| - SnXn\/iVnr) oo as n CO 

and thus the optimal error in S is attained at IL(^„) as defined in Equa- 
tion (11). 

Dimension independent bounds. Here, we produce a dimension inde- 
pendent bound on g[9n^] by explicitly bounding Lg^{9n), Ag^{9n,Cn) and 
Bg (9 n'>Cn). and then substituting them in Corollary 2.2. 
By construction E{c^^) < 1 and by using Lemma A. 3 we have SDe^(c~^) < 
{nr)~^ SDg^^{U). Now, we have, 

(20) nAg^{9n, cn)} < r~^ n~HFg^{9 f)Yarg„ iU) 

(21) + n-' SBg„{\\9n - 9n\\Vn) SBg„{U) 



26 



The ideal flattening coefficient IFg^ {d'n ) — (1+^ ^) and the other quantities 

(22) nBe„{en,Cn)} < r"2n-iBias2([/„) 

(23) nLeM < n~^YareA\\On - On\\^) 

For the JS estimator, for each of the terms in the R.H.S has an upper 
bound in terms of n and a„. We can use the following crude upper bounds 
depending on n only which will provide us a uniform bound over M"-: 

Nature of shrinkage. The plug-in estimate pE performs better than pL 
and g[9 •^'^+, 1 + r] when a„ is close to but gets dominated with increasing 
values of a„. And, g[6 ■^^~^] is asymptotically better than throughout. 
The relationship between g[6 ^] and pn can not be expressed explicitly. But 
following Brown, George and Xu (2008, Theorem 1) we can express the risk 
of ph as: 

1 

(24) p{OnSH) = - V-'q{en/v,0'')dv 

^ i(l+r-i)-i 

where 6^ denotes the posterior mean of the Harmonic prior. Equation 24 
can be used to numerically evaluate the risk of pn as the risk of 9h has 
closed form. The fact that these estimators are spherically symmetric will 
also help. We also get the following crude bound 

C inf q{9Jv,e'')<p{9n,PH)<C sup q{9n/v,e'') 

Pn&A{en) /3„GA(e„) 

where A{9n) = { (3n = k 9n : I < k < Vl + r~^ } and C = log(l + r^^) /2. 

Minimaxity over Unrestricted Spaces. For any dimension n, p^ is a min- 
imax estimator. However, in dimensions greater than 2, p^ is inadmissible 
and so there exists improved minimax estimators, pn is an improved min- 
imax estimator than p^ for n > 3. g[9 ■^'^+] is also an asymptotic minimax 
estimator and with huge improvements over p^ which can also be explicitly 
quantified. Using Theorem 1.1 asymptotically minimax predictive density es- 
timates can be constructed around asymptotic minimax location estimates. 

2.6. A Motivational Example from Sports Betting. Consider a game in 
which the outcomes depend on the actions of n players. Bets can be placed 
on a countable collection of (possibly overlapping) measurable sets A = 
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{Ai : i = 1, - ■ ■ ,k} with /c < oo in M". The maximum growth rate in such a 
betting market is given by 



where P is the true probabihty distribution of the actions in the game, 
^(K") is the set of all probability measures on and k is the cardinality 
of the collection A. 

Assume initially that the collection is exhaustive, i.e, UjAj = M". We 
can construct a mutually disjoint partition B = {Bi : 1 < i < 2'^} of the 
collection A where Bi = Hj^iAj^^'''^ where j] is the j^^ term in the binary 
expansion of i and for any set A^ = A^ and A^ = A. We do not track null Bi 
in B and would ignore them through out. Let n{Bi) denotes the number of 
repetitions of the subset Bi in the collection A i.e K{Bi) = cardjj : BiCiAj ^ 
4> and j = 1, • • • , k}. Note that K.{Bi) G [1, k] and under finite overlaps we 
can assume that sup?^^ ^{Bi) = c < oo and we define a weight function on 

M" as w{x) = Ifi. (x). Note that, w{x) S (0,1] acts as a tilt 

function for the densities p{x) and q{x). 

Theorem 2.1. // the probability measure P and Q have densities p and 
q with respect to Lebesgue measure, then for any countable collection of ex- 
haustive measurable sets A we have, 



where D{p\\q) = J p{x) log{p{x) / q{x)} dx is the differential relative entropy 
between P and Q. 

Proof. If the collection consists of mutually disjoint sets then the proof 
follows from the data processing inequalities associated with quantization 
idea in information theory. The function t log t is strictly convex if t > 0. So 
for any positive random variable T and any sigma-finite measure, by Jensen's 
inequality we have, E^{T\ogT) > E^{T) log E^{T). For any measurable 
set A, with T{x) = p{x)/q{x) and measure ^(x) = q{x)/Q[A)dx we have 
P{A)\ogP{A)/Q{A) < J^p{x)log{p{x)/q{x)} dx and so the proof extends 
to mutually exclusive cases. 

If the events are not mutually disjoints then we can construct its mutually 
disjoint partition B = {Bi : 1 < i < 2^} as above and using the Log-Sum 



k 



(25) 




k 



P{Ad log {P{Ai)/Q{A.i)] < c ■ D{p\\q) 



i=l 
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inequality (Cover and Thomas, 1991, Theorem 2.7.1) separately on each Ai 
we have, 



Y,P{A^)log{P{Ai)/Q{A,)} <J2^iBi)P{Bi)\og{P{Bi)/Q{B,)} 

i=l i=l 

and again using the above quantization argument we can show that the 

R.H.S above is less than c / w{x)p{x)log{p{x) /q{x)} dx = cD{'w.p\\'w.q). 
Now, observe that 

D(w.p\\w.q) — D{p\\q) = j {\ — w{x)) p{x)\og{^q{x)/p{x)^ dx 

[\ — w{x)^ q{x) dx 



< log 



by Jensen's inequality and the result follows as J [l — w{x)^ q{x) dx < 1. □ 

If the collection is not exhaustive we can restrict our densities to the 
corresponding subsets of M". 

2.6.1. An illustration with a Dataset. We consider the Baseball data 
that was used to show the advantage of shrinking location estimates in 
Efron and Morris (1977). The dataset consists of 18 players (so, n = 18 
which is not so high dimensions) with exactly 45 at-bats on a particular 
date during the 1970 season. The objective is to predict the performance of 
the players on the remainder of the season . 

The number of hits (H) and the number of at-bats (N) over two portions 
of the season were 

Hji Binomial (iVji, Pi), j = 1,2; i = 1,... ,n. 

Where j = 1 denotes past data and j = 2 represents the unknown future. 
As the variance of the Binomial model depends of the mean parameter pi, 
sl variance stabilization transformation Brown (2008) is conducted (which 
goes through as Nij are quite large). The transformation 

(26) Xji = arcsm — ^ — 

reduces the binomial model to the normal model 

(27) X.ji ~ N^Oi, a%) where Oi = arcsin^, = (4iVji)"^ 
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r 


Pe 


PL 


Qm 




g[0"] 


Ph 


0.1 


22.963 


19.451 


15.487 


11.435 


19.232 


19.578 


0.2 


11.482 


14.174 


10.539 


7.418 


13.982 


14.289 


0.5 


4.593 


8.326 


5.418 


3.717 


8.188 


8.424 


1 


2.296 


5.067 


2.886 


2.047 


4.975 


5.142 


2 


1.148 


2.868 


1.415 


1.081 


2.815 


2.924 


5 


0.459 


1.250 


0.524 


0.448 


1.227 


1.286 


10 


0.23 


0.645 


0.248 


0.227 


0.633 


0.614 



Table 1 

Predictive loss of different Gaussian strategies on the Baseball data. 



and X^i independent for 1 < i < n. With the past P = Xi, and the future 
F = X2. we have the following predictive set-up : 

(28) F\en ~ iV(^n, Vyin) ; Pj^n ~ iV(^n, vjn)- 

We want joint predictive densities of the future performances of players 
in this standardized model. We use a very naive evaluation strategy by 
considering the entire season's batting average as the true parametric value. 
In the entire season the players ended up playing around 400 games on the 
average. So, evaluating the predictive densities at 0^ = arcsin ""^^^ where 
^fuU g^j^g ^j^g batting averages from the entire season will not be terrible. 
Evaluation procedures with guarantees may be developed in a sequential 
set-up Lai, Gross and Shen (2011). 

While using shrinkage on the location estimators we shrink towards the 
grand average. We evaluate the 6 different predictive strategies of Section 2.5 
for different values of the future to past variability. The value of r will be 
close to 0.1 when we consider prediction on the entire remaining half of the 
season. 

We find that for any choice r, g[6 -^'^+] is the best one among the 6 estima- 
tors considered. Also, — d\ (as discussed in Section 2.5) is decreasing in 
r. Pe behaves well when r is large and horribly for small values. The losses 
for Ph and g[9 ^] are very similar. 

3. Restricted minimax predictive risk of Q[p\. 

A typical member in the class Q[p\ of all product Gaussian predictive den- 
sities is represented by g[9mDn] = Wj^=iN[9{i), d{i) a?). Generalizing the 
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argument in Lemma 2.5 we see that a lower bound on the minimum predic- 
tive risk Pp{9n,0n) of all density estimates in Qn[p\ that have mean On, is 
given by 

(29) On) > \ ^ Ee(,){ log (1 + {e{i) - e{i)f)}. 

i=l 

The predictive risk of the estimate g[9n-, Dn] is given by 
(30) 

i=i i=i ^ J 

It is not necessarily true that 

Pp{9n,9n) = vain p{en, g[9n, Dn]) 

asymptotically equals the lower bound given in Equation (29). In the previ- 
ous section we saw that under sufficient regularity conditions these bounds 
matches. The ideas there can be extended to block-wise estimators and to 
non-orthogonal models by using the concept of Mallow's unbiased risk es- 
timates. In the io sparse predictive space as the degree of sparsity tends 
to zero, i.e., s/ra — )• as n — )• oo, the lower bound given in Equation (29) 
is significantly greater than the minimax predictive risk over Q\p]. And so, 
procedure used in the previous section can not be used for finding the asymp- 
totic minimax predictive Gaussian risk over 0(n, s). 

Minimax predictive risk over sparse parameter spaces. Here we outline 
the proof of Theorem 1.3. Following the Bayes-Minimax procedure of Johnstone 
(2012) (Chapter 4.4) the multivariate minimax problem can be reduced to 
univariate minimax problem with moment prior constraints 

tn(?7) = {vr G ■p(R) : 7r(0) > 1 - ??} 

where 7^(M) is the collection of all probability measures on R. In Theorem 1.1 
in Mukherjee and Johnstone (2012) we have the univariate minimax risk 

inin max / p{9,p) 7t{6) d9 ^ (1 + r)^^ rj logr]~^ as rj ^ 0. 

p 7rGm(»;) J 

When restricted to the Gaussian family the minimax risk will be 
min max / p{9,p) tt{9) d9 ~ f{r]) as ry — )■ 

p€G nem{ri) J 
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where /(ry) = r] log 77^^. In this univariate asymptotic set-up the lower 
bound in Equation (29) is much lower than the asymptotic rate r/logr/"^ 
and hence unusable. We get an upper bound on the minimax Gaussian risk 
as from point estimation theory Donoho and Johnstone (1994)it follows that 
the minimax plug-in risk in this asymptotic set-up is /(?]). For a lower bound 
consider the predictive risk of the normal density estimate g[0,d] 

(31) p{e,g[6,d]) =Ee{logd)+Ee{d-'-{l + (e-ef)-l}. 

And the idea is to establish the necessity of threshold zone as done in 
Johnstone and Silverman (2004). For p(0, g[9,d]^ - the predictive risk of 

g[6, d] at the origin, to be lower than the order of 77 we need a threshold size 
of at least A (7?) = \/2 log ry~^. And for density estimators of the form 



(32) p[Xiv)]{-\X) 



N{ 0, aj ) if \X\ < A(r/) 

N{e{X), d{X)aj) if|X|>A(77) 



the supremum predictive risk at the non-zero support points is /(??), i.e., 
(33) sup p{0,p[ A(r?) ] ) ~ f{rj) as r? ^ 0. 

Thus, it follows that sup-optimality of the class Q\p\ is 1 + r^^. 
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APPENDIX A 

Lemma A.l. Yn is sequence of random variables such that Yn = Xn(^n) 
for a non-negative and increasing sequence {A„ : n > 1} then for n > 5 we 
have 

Var(y-^) < ki{n) ■ n'^ where ki{n) = 3 (1 - 2/n)-'^{l - A/ny^ . 

Proof. We observe that Yn being a non-central chi-square random vari- 
able can be written as convolution of central Chi-square and Poisson random 
variables 

Yn = xl+2N where Nn = Poisson(A„/2). 
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Decomposing the variance by conditioning on the Poisson random variable 
we have, 



Yar (Y- 



VarA. f E (y-i|iV„) j + E^^ (Var {Y-^\N„ 

1 \ „ / 2 



Yar 



An 



n + 2Nn - 2 



+ E 



(n + 2Ar„-2)2(n + 2iV„-4) 



which fohows from moments of central chi-square (gamma) distribution and 
as Nn > the second term on the R.H.S is < 2(n — 2)~^(n — 4)"^ and by 
Lemma A. 3 we have 



(n - 2)^Yarx„ 



1 



n + 2Nn - 2 



Yarx„ 



1 



1 + 2Nn/{n - 2) 
< {1 + 2E{Nn)/{n - 2)}"^ Yar 
4A„(n-2)2 



2iVn 

n-2 



(n-2 + 2A„)4 - 2(n-2)' 
Thus, Yar [y-^] < 3 (n - 2)~^{n - 4)-\ □ 

Lemma A. 2. IfYn = Xni^n) md A„ is an increasing sequence then 
\lP{Yn <n-2)< 0{n) 

Proof. Holds trivially for A„ < 0{^/n). So we will prove for all other 
sequences i.e sequence where \nl\fn is not bounded. Note that P{Yn < 
n — 2) < P{Yn < n). And as is a non-central chi-square we have 

Yn = Vn+2N where N = Poisson(A„) and Vn = x^(0) 
Now, for any fixed n and N we have, 

P{yn+2N ^ n) < 2P{Vm+2N ^ TTi) for all m > n such that m — n is large. 
Because P{Vm+2N < m\Vn+2N < n) < -P(Xm_„(0) < m - n) < 1/2. So, 



lim P{Yn <n)= lim ExJP{ Vn+2N < n 



N 



< 2 lim ExA lim P( K+27V < n 



-2N 

2 lim <i $ ( , 

n^oo " I \ V2n m 



N 
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as we can do a normal approximation to the sequence of central chi-square 
random variables. Next, we interchange the integrals (by Fubini's as inte- 
grand is positive) and then use bounded convergence theorem to have, 



lim P{Yn < n) < 2 lim / <^(z)Pa„ 

n— >oo 71— >oo J 

= 2 [ ct>{z) lim Pa„ 



2N , , 

< z ] dz 



V2n + 4N 

2N 
V2n + m 



< z ] dz 



Now for all large n, A„ is large (as increasing and A„/ y/n is not a bounded 
sequence). So each large n, we can separately do a normal approximation to 
the Poisson random variable A^. 

Consider the case first when A„ > 0{n). In this case the following naive 
bound will work: 



V2n + 4iV ~ J ~ " \ - V\/A 



We will use this bound for all z such that z'^ ^ where tn equals n ^(A^j — 
V%J^V^log"^rr+~21ogn) . Also note that, 

A2$ f " ) < 0(n) for ah < t„ and $(t„) = 0(nA~2). 



And so, it follows that A„ lim„_j.oo -P(^ ^ n) < 0{n). □ 
Lemma A. 3. For any non-negative random variable Y 
Yar{{l + Yy^} < {1 + E{Y)y^ Yar{Y). 
Proof. As Y is non-negative we have 

1 1 {y-E{Y))^ ^{Y-E{Y))'^ 



1 + Y i + E{Y)J (1 + y)2(i + £;y)2 - {i + ey^ ' 

Now, taking expectation on both sides and using Bias- Variance decomposi- 
tion we get 

This completes the proof. □ 
Lemma A. 4. For any random variable X we have Var(X+) < Var(X). 
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Proof. With the decomposition of X = X+ — X_ we have 

Var(X) = E(X2) -E2(X) 

= E{Xl) + E(X2 ) - E2(X+) - E2(X_) + 2E(X+) E(X_) 
= Var(X+) + Var(X_) + 2E(X+)E(X_) 

and we get the stated result as ah the terms in R.H.S. are non-negative. □ 
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